9  Classification and Prediction

Learning Objectives

By the end of this chapter, you should be able to:

  1. Explain the difference between prediction and explanation, and describe why the distinction matters in modelling.
  2. Describe why we split data into training and testing sets, and explain how this helps detect overfitting.
  3. Fit and interpret a linear probability model (LPM), including interpreting coefficients as changes in probability.
  4. Use predicted probabilities to make classifications by applying a threshold, and explain how changing the threshold changes model decisions.
  5. Construct and interpret a confusion matrix, identifying TP, TN, FP, FN, and compute performance measures such as accuracy, false positive rate (FPR), and false negative rate (FNR).

9.1 Introduction

Over the past few chapters, we have focused on linear models, where the outcome of interest is numeric. Using tools such as simple and multiple linear regression, we learned how to quantify relationships between variables, interpret model coefficients, and use data to make predictions about continuous outcomes — such as income ($), age (years), reaction time (secs), or performance scores (points).

However, not all real-world questions involve predicting numbers. In many situations, our goal is to predict a category, group, or decision rather than a numerical value. For example:

  • Will a customer default or not default on a loan? (yes or no)
  • Will a student pass or fail a unit? (pass or fail)
  • Is an athlete selected or not selected for a team? (selected or not selected)

These problems fall under the umbrella of classification, where the outcome variable is categorical rather than numeric. In this chapter, we extend the ideas of modelling and prediction to these types of problems. While the goal changes—from predicting a number to predicting a class—the underlying principles remain familiar. We still:

  • use predictors (explanatory variables),
  • build models based on observed data,
  • evaluate how well those models perform, and
  • use them to make predictions for new observations.

In this chapter, we take a first step into classification by using a familiar tool in an unfamiliar way: linear regression. Specifically, we introduce the linear probability model (LPM), where a binary outcome is modelled using a linear regression framework. The LPM allows us to interpret coefficients as changes in probability, making it an intuitive and accessible entry point into classification modelling—especially given our existing understanding of linear models.

However, while the linear probability model is simple and easy to interpret, it comes with important limitations, such as:

  • the possibility of predicted probabilities falling outside the 0–1 range,
  • violations of key regression assumptions, and
  • challenges in interpreting uncertainty and model fit.

Understanding why the linear probability model breaks down is just as important as knowing how to fit it. These limitations motivate the need for more appropriate models for binary outcomes. In the next chapter, we will build on this foundation by introducing logistic regression, a model specifically designed for classification problems. Logistic regression resolves many of the issues encountered with the linear probability model while retaining a clear interpretation in terms of probabilities, making it a central tool in statistics, econometrics, and data science.

9.2 A quick recap: Linear regression

Before moving into classification models, we briefly revisit linear regression as this will be crucial for understanding classification.

Suppose we are interested in predicting goal difference (also called Margin) in a football match. A positive margin means the home team wins by that many goals, while a negative margin indicates a loss. As a simple starting point, we assume that the only predictor of match outcome is the difference in team ratings (Ratings Difference) prior to the game.

We observe the following data:

Margin

Ratings Difference

-20

-17

-17

-30

4

0

6

18

40

31

49

46

In this particular case, we can express the relationship between these two variables as:

\[Margin=\beta_0+\beta_1(\text{Ratings Difference})\]

Theoretically this model makes sense, as larger rating differences should have greater margins (imagine an elite adult level team vs a group of 5-year-old children - the difference in ratings would be huge, and therefore we would predict a much larger margin).

If we were to plot this data it might look like the image below (where the dashed blue line represents the trend line).

And using the lm() function in R, we can easily see what the values of the \(\beta\) coefficients would be in this model:

R Code
lm(Margin ~ Diff, df1)

Call:
lm(formula = Margin ~ Diff, data = df1)

Coefficients:
(Intercept)         Diff  
     2.8088       0.9406  

From the output above, our fitted model would be:

\[\text{Margin}=2.81+0.94(\text{Ratings Difference})\]

9.3 Explain versus Predict

In Statistics, it’s important to be clear about what we are trying to achieve. Broadly, statistical models are used for two related but distinct purposes: explanation and prediction.

9.3.1 Explaination

Explanation is about understanding why things happen. An explanatory model focuses on the relationship between variables, aiming to isolate and interpret the effect of one variable on another. For example, we might ask: How much does a one-unit increase in team rating difference change the expected goal margin, on average? From our model above, we can see that the slope for Ratings Difference (\(\beta_1\)) is 0.94. This helps us to explain the relationship between ratings difference and margin, i.e. for every one-unit increase in ratings difference, the expected margin increases by 0.94 points.

9.3.2 Prediction

Prediction, by comparison, is about using a model to generate predictions for future or unseen observations. The primary goal is not to interpret individual coefficients, but to produce accurate forecasts of an outcome based on available information. For instance, we might use team rating differences to predict the goal margin in an upcoming match

For example, suppose two teams that were to play a match next week. The ratings for both teams are as follows:

  • \(Rating_\text{Team A}=1530\)
  • \(Rating_\text{Team B}=1510\)

In this case, the Ratings Difference is \(1530-1510=20\). Therefore, using our model, we would predict the Margin to be:

\[ \begin{aligned} \text{Margin} &=2.81 + 0.94(\text{Ratings Difference}) \\ &=2.81 + 0.94(20) \\ &=21.61 \end{aligned} \]

9.4 Training and Testing Sets

So far, when fitting a regression model, we have used all available data to estimate the relationship between predictors and the outcome. While this is useful for understanding relationships, it does not tell us how well the model will perform on new, unseen data.

In prediction and classification problems, this distinction is crucial. Our goal is not just to explain the past, but to make accurate predictions about the future.

9.4.1 Training data

The training dataset is the portion of data used to fit the model. This is where the model learns the relationship between variables—for example, how differences in team ratings relate to goal margin.

Using training data, we:

  • estimate model parameters (such as the intercept and slope),
  • choose model structure, and
  • understand how predictors influence the outcome.

However, a model that fits the training data very well may still perform poorly on new data. This phenomenon is known as overfitting, where the model captures noise or idiosyncrasies specific to the training sample rather than the underlying pattern. For example, let’s use our current data to fit two models: (1) a linear model and (2) a complicated non-linear model that connects all of the data.

9.4.2 Testing data

The testing dataset is held back during model fitting and is used only to evaluate model performance. Because the model has never “seen” this data before, performance on the test set provides a more realistic measure of how well the model is likely to perform in practice. For example, suppose we have 4 games (highlighted in green below) that were held back during training the model. We can visualise their deviations from the models as the solid black lines:

If we sum all of the errors (and look at the lengths of the errors - see image below), we would find that the linear model has a smaller error term which ultimately leads to better predictions.


So even though the complicated model fitted the data better, the linear model was better at making predictions. So if we had to choose between these two models for making predictions, we would go with the linear one. Note: fitting the training data well but making poor predictions is the bias-variance trade-off problem.

9.5 Linear Probability Model (LPM)

In the previous section we looked at an example where the outcome (Margin) was numeric, see table below which includes a few extra data points.

Ratings Difference

Margin

-17

-20

-30

-17

0

4

18

6

31

40

46

49

10

-1

5

-16

20

-9

Suppose that instead of modelling Margin, we were interested in a binary outcome such as Win / Loss

Ratings Difference

Margin

Win

-17

-20

Loss

-30

-17

Loss

0

4

Win

18

6

Win

31

40

Win

46

49

Win

10

-1

Loss

5

-16

Loss

20

-9

Loss

Usually these are denoted with 0s and 1s to make computation a bit easier

Ratings Difference

Margin

Win

-17

-20

0

-30

-17

0

0

4

1

18

6

1

31

40

1

46

49

1

10

-1

0

5

16

0

20

-9

0

Now, if we tried to plot Ratings Difference and Win, we would obtain the Figure below. Notice here that because our outcome is binary, we can only have y values of 0 (Loss) or 1 (Win):

And similar to our previous examples, we can fit a linear regression model for this data

R Code
lm(Win ~ Diff, df4)

Call:
lm(formula = Win ~ Diff, data = df4)

Coefficients:
(Intercept)         Diff  
    0.32123      0.01336  

\[\text{Win}=0.321+0.013(\text{Ratings Difference})\]

If Win was a numeric outcome, we would interpret the coefficient in the same way as in standard linear regression. For example:

For each additional one-point increase in Ratings Difference, Win increases by 0.013.

However, this interpretation does not make much sense in this context. The reason is that Win is a binary variable — it can only take the values 0 (loss) or 1 (win). There is no meaningful way for Win to increase by 0.013 in an observed match; a team either wins or it does not.

Instead, when using a linear probability model (LPM), we reinterpret the outcome variable. Although Win is coded as 0 or 1, the expected value of Win represents a probability:

\[E(\text{Win})=P(\text{Win}=1)\]

This means the model is not predicting individual wins or losses directly, but rather the probability of winning. Under this interpretation, the coefficient is understood as a change in probability:

For each additional point in Ratings Difference, the chance of winning (Win = 1) increases by 0.013.

This interpretation is intuitive and one of the main reasons the linear probability model is often used as a first step in classification problems. It allows us to:

  • interpret coefficients in a familiar linear regression framework, and
  • directly link predictors to changes in probability.
Caution

However, this convenience comes at a cost. Because the model is linear, it can predict probabilities less than 0 or greater than 1, which are not meaningful. For example, suppose we had the following scenario:

  • \(Rating_\text{Team A}=1500\)
  • \(Rating_\text{Team B}=1425\)

In this case, the Ratings Difference is \(1500-1425=75\). Therefore, using our model, we would predict the chance of winning to be:

\[ \begin{aligned} \text{Win} &=0.321 + 0.013(\text{Ratings Difference}) \\ &=0.321 + 0.013(75) \\ &=1.323 \end{aligned} \]

Clearly, this would not be possible (likewise, we could also have negative probability values as well). In the next chapter, we will learn about a statistical test that can deal with this limitation.

9.6 Thresholds

When using models for classification, the output is often a probability, not a direct class label. For example, a linear probability model might estimate that a team has a 0.63 probability of winning a match. On its own, this probability is informative — but in many practical situations, we must go one step further and make a decision: win or loss, approve or reject, intervene or not. To turn predicted probabilities into class labels, we must choose a classification threshold. A threshold specifies the probability value above which we predict one class (typically coded as 1) and below which we predict the other (coded as 0). A common default is 0.5, but this choice is not universal and is often not optimal.

Let’s revisit the example from earlier. Using the derived model we can determine the predicted probability of Win for each observation.

Let’s also plot these predictions (see Figure below). A threshold is then chosen (default = 0.50) to help us classify the predictions. Predictions below the threshold would have a predicted classification of 0. Whereas, predictions above the threshold would have a predicted classification of 1

Using this threshold, we can see which observation would be classified as a 0 and which ones would be classified a 1:

Win

R.Diff

Prediction

Class (Thresold = 0.50)

0

-17

0.090

0

0

-30

-0.080

0

1

0

0.321

0

1

18

0.562

1

1

31

0.735

1

1

46

0.936

1

0

10

0.455

0

0

5

0.388

0

0

20

0.588

1

To see how well the model did at making predictions with this threshold, we can compare the actual outcome (Win) to the predicted class. We can see in most cases, the Actual and the Predicted outcomes are the same. But in some other cases, they are not the same (which means that the prediction is incorrect). Specifically, we cab see two predictions at this threshold which are incorrect:

Win

R.Diff

Prediction

Class (Thresold = 0.50)

Correct Prediction?

0

-17

0.090

0

Yes

0

-30

-0.080

0

Yes

1

0

0.321

0

No

1

18

0.562

1

Yes

1

31

0.735

1

Yes

1

46

0.936

1

Yes

0

10

0.455

0

Yes

0

5

0.388

0

Yes

0

20

0.588

1

No

When evaluating a binary classification model, results are often summarised in a 2×2 table, sometimes called a confusion matrix. This table compares what the model predicted with what actually occurred. Each cell in the table corresponds to a specific type of outcome, and understanding these outcomes is essential for interpreting model performance.

Predicted Actual Total
Win Loss
Win 3 1
Loss 1 4
Total 4 5

From this table, we can easily see that:

  • Of the 4 predicted ‘Win’ outcomes, 3 was correct and 1 was incorrect
  • Of the 5 predicted ‘Loss’ outcomes, 1 was incorrect and 4 were correct

9.7 Type 1 and Type 2 Errors

Whenever we use a model to make a classification decision, there is the possibility of making a mistake. In binary classification problems, these mistakes fall into two distinct types, known as Type I and Type II errors. Understanding the difference between them is essential, because not all errors have the same consequences.

A Type I error occurs when the model predicts a positive outcome when the true outcome is negative. In other words, we classify an observation as belonging to class 1 when it actually belongs to class 0. In the example above, this would mean predicting a win when the team in fact loses. This type of mistake is commonly called a false positive.

A Type II error occurs in the opposite situation: when the model predicts a negative outcome even though the true outcome is positive. That is, we classify an observation as class 0 when it truly belongs to class 1. In our football example, this corresponds to predicting a loss when the team actually wins. This type of mistake is commonly called a false negative.

In our current example:

Actual Win Actual Loss
Pred Win 3 1
Pred Loss 1 4
Actual Win Actual Loss
Pred Win True Positive False Positive
Pred Loss False Negative True Negative
Note

True Positives (TP)

A true positive occurs when the model predicts a positive outcome and the actual outcome is also positive. In other words, the model correctly identifies a case that truly belongs to class 1. For example, in a football context, a true positive would be predicting a win when the team actually wins. True positives represent correct positive classifications.

False Positives (FP)

A false positive occurs when the model predicts a positive outcome, but the actual outcome is negative. Here, the model signals a win when the team in fact loses. False positives are often associated with Type I errors. While the model appears confident, it has incorrectly classified a negative case as positive.

True Negatives (TN)

A true negative occurs when the model predicts a negative outcome and the actual outcome is also negative. This corresponds to correctly predicting a loss when the team does indeed lose. True negatives represent correct negative classifications, and together with true positives, they contribute to overall model accuracy.

False Negatives (FN)

A false negative occurs when the model predicts a negative outcome even though the actual outcome is positive. In this case, the model predicts a loss when the team actually wins. False negatives are associated with Type II errors and reflect situations where the model fails to detect a true positive outcome.

Confusion matrixes can be useful in comparing the classification accuracy of different models. For example, let’s compare the two models below, which generated the following confusion matrixes:

Model1
Actual Win Actual Loss
Pred Win 242 47
Pred Loss 51 210
Model2
Actual Win Actual Loss
Pred Win 51 248
Pred Loss 263 62

In this example, it is clear that Model 1 (more True Positives and True Negatives) is better than Model 2 (which has more False Positives and False Negatives).

But sometimes it isn’t so obvious. For example, consider the following models and their respective confusion matrixes:

Model1
Actual Win Actual Loss
Pred Win 242 47
Pred Loss 51 210
Model3
Actual Win Actual Loss
Pred Win 239 40
Pred Loss 32 141

This is where the following formulas can help us in assessing the errors in our predictions.

9.7.1 False Positive Rate (Type I Error)

The false positive rate (also called Type I Error) tells us the proportion of negative cases that were incorrectly predicted as positive (in this example, losses incorrectly predicted as wins), and can be calculated with the following formula:

\[\text{False Positive Rate (Type I Error)}=\frac{\text{False Positive}}{\text{False Positive + True Negative}}\]

If we consider Model1 and Model3 above, the False Positive rate for these two would be:

\[ \begin{aligned} \text{False Positive Rate}_\text{Model 1} &= \frac{47}{47+210} \\ &= \frac{47}{257} \\ &= 0.183 \end{aligned} \]

\[ \begin{aligned} \text{False Positive Rate}_\text{Model 3} &= \frac{40}{40+141} \\ &= \frac{40}{181} \\ &= 0.221 \end{aligned} \]

Here we can see that Model 1 has a lower False Positive Rate (which means it has made fewer errors in predicting wins)

9.7.2 False Negative Rate (Type II Error)

We also have the false negative rate (also called Type II Error), which tells us the proportion of positive cases that were incorrectly predicted as negative (i.e. wins incorrectly predicted as losses). The False Negative Rate is given by:

\[\text{False Negative Rate (Type II Error)}=\frac{\text{False Negative}}{\text{False Negative + True Positive}}\]

For our two models, the False Negative Rate are:

\[ \begin{aligned} \text{False Negative Rate}_\text{Model 1} &= \frac{51}{51+242} \\ &= \frac{51}{293} \\ &= 0.174 \end{aligned} \]

\[ \begin{aligned} \text{False Negative Rate}_\text{Model 3} &= \frac{32}{32+239} \\ &= \frac{32}{271} \\ &= 0.118 \end{aligned} \]

Here we can see that Model 3 has a lower False Negative Rate (which means it has made fewer errors in predicting losses).

9.7.3 Choosing Models based on Type I/II Errors

If we compare the two models above, we can see that:

  • The False Positive Rate is better for Model 1 than Model 3.
  • By comparison, the False Negative Rate is better for Model 3 than Model 1.

Which model we choose depends on what types of mistakes matter most in the context of the decision being made. In many real-world situations, decisions are binary, and the outcome of a decision can be either correct or incorrect when compared to what actually happens. These incorrect decisions map directly onto the ideas of false positives and false negatives that we have just introduced.

Consider the two situations below. In each case, a decision must be made, and an incorrect decision can occur in one of two ways. However, the consequences of these errors are not the same. As you read through each scenario, think carefully about which type of error would be more acceptable—or less harmful—to make, and why. This exercise highlights an important idea in classification: there is rarely a single “best” model in all situations. Instead, the most appropriate model is the one that balances errors in a way that aligns with the goals and risks of the problem at hand.


Consider a court case situation where a defendant is on trial. The defendant is either Guilty or Not Guilty of the charges laid against them, and the judge has to make a decision (guilty or not guilty) based on the evidence presented:

  • A correct guilty verdict would be a TRUE POSITIVE
  • A correct not guilty verdict would be a TRUE NEGATIVE
  • However, if they judge concludes Guilty, but the defendant was actually Innocent, then this would be a FALSE POSITIVE
  • Conversely, if the judge concludes Not guilty, but the defendant actually was Guilty, then this would be a FALSE NEGATIVE.

In this situation, the consequences of incorrect decisions would be:

  • False Positive: Innocent person goes to jail
  • False Negative: Guilty person walks free

In your opinion, which error is worse to make?


Let us now consider a different example that involves cancer patients. Suppose a patient has been advised to take a test to screen for lung cancer, and the test will tell the patient one of tweo outcomes (cancer or no cancer). Here:

  • A correct cancer diagnosis would be a TRUE POSITIVE
  • A correct non-cancer diagnosis would be a TRUE NEGATIVE
  • If the test says ‘Cancer’, but the patient doesn’t have Cancer, then this would be a FALSE POSITIVE
  • If the test says ‘No Cancer’, but the patient does have Cancer, then this would be a FALSE NEGATIVE

In this situation, the consequences of incorrect test diagnosis would be:

  • False Positive: Stress, Worrying
  • False Negative: Condition becomes worse (because not treated early)

In your opinion, which error is worse to make?


In the two examples above, we can see that different incorrect decisions can have largely different impacts. If correctly identifying positives is more important (e.g. detecting Cancer), then we would choose models that emphasise False Positive Rates. On the other hand if it’s preferable to have a guilty person walk through (as opposed to an innocent person going to jail), then we might choose a model that emphasises False Negative Rates.

9.7.4 Overall Accuracy

Apart from the false positive and false negative rates, we can also calculate overall accuracy, which measures the proportion of all predictions that are correct.

\[\text{Overall Accuracy}=\frac{\text{True Positive + True Negative}}{\text{True Positive + False Positive + True Negative + False Negative}}\]

While accuracy is simple and intuitive, it is often less informative on its own. In practice, it is more common to keep these measures separate because accuracy hides the type of mistakes being made. Two models can have the same overall accuracy but very different balances of false positives and false negatives. In many real-world applications, these errors do not carry the same consequences, and treating them as equally important can lead to poor decision-making.

By examining false positive and false negative rates separately, we gain a clearer understanding of how a model is making errors, not just how often it is wrong. This allows us to choose models and classification thresholds that align with the priorities and risks of the problem, rather than relying on a single summary number.

9.8 Choosing different thresholds

In the examples so far, we have used a default threshold of 0.5 to convert predicted probabilities into class labels. That is, if the predicted probability of a win is greater than or equal to 0.5, we predict Win; otherwise, we predict Loss. While this rule is common, it is not mandatory and is often not the best choice.

The classification threshold controls the trade-off between false positives and false negatives. Lowering the threshold makes it easier for observations to be classified as positive, which typically increases the number of true positives but also increases false positives. Raising the threshold has the opposite effect: fewer positives are predicted, reducing false positives but increasing false negatives.

Which threshold is most appropriate depends on the context of the decision. In situations where false positives are costly—such as incorrectly convicting an innocent person or falsely flagging a legitimate transaction as fraud—a higher threshold may be preferred. In contrast, when missing a true positive is more serious—such as failing to detect a medical condition or overlooking a high-risk case—a lower threshold may be more appropriate.

Importantly, changing the threshold does not change the underlying model or the predicted probabilities. It only changes how those probabilities are translated into decisions. This means that even with the same model, different stakeholders might reasonably choose different thresholds depending on their priorities and tolerance for risk.

As an example, let us consider the data from earlier. In the figure below we have the same data plotted with three different threshold (as the dashed line) values: 0.25, 0.50 and 0.75 respectively.

Now depending on which threshold we decide on, how we classify each observation can vary:

Actual

Classification (0.25)

Classification (0.50)

Classification (0.75)

0

0

0

0

0

0

0

0

1

1

0

0

1

1

1

0

1

1

1

0

1

1

1

1

0

1

0

0

0

1

0

0

0

1

1

0

If we compare each classification to the actual value, then we can see which predictions were correct under each of these three thresholds:

Actual

Classification (0.25)

Classification (0.50)

Classification (0.75)

0

Correct

Correct

Correct

0

Correct

Correct

Correct

1

Correct

Incorrect

Incorrect

1

Correct

Correct

Incorrect

1

Correct

Correct

Incorrect

1

Correct

Correct

Correct

0

Incorrect

Correct

Correct

0

Incorrect

Correct

Correct

0

Incorrect

Incorrect

Correct

This allows us to create confusion matrixes for each threshold:

Threshold = 0.25
Actual Win Actual Loss
Pred Win 4 3
Pred Loss 0 2
Threshold = 0.50
Actual Win Actual Loss
Pred Win 3 1
Pred Loss 1 4
Threshold = 0.75
Actual Win Actual Loss
Pred Win 1 0
Pred Loss 4 5

Finally, we can compute the False Positive and False Negative rates to make a decision on which threshold we would like to choose

For the False Positive Rates:

\[ \begin{aligned} \text{False Positive Rate}_\text{(0.25)} &= \frac{3}{3+2}=0.60 \\ \end{aligned} \] \[ \begin{aligned} \text{False Positive Rate}_\text{(0.50)} &= \frac{1}{1+4}=0.20 \\ \end{aligned} \]

\[ \begin{aligned} \text{False Positive Rate}_\text{(0.75)} &= \frac{0}{0+5}=0.00 \\ \end{aligned} \]

And for the False Negatives:

\[ \begin{aligned} \text{False Negative Rate}_\text{(0.25)} &= \frac{0}{0+4}=0.00 \\ \end{aligned} \]

\[ \begin{aligned} \text{False Negative Rate}_\text{(0.50)} &= \frac{1}{1+3}=0.25 \\ \end{aligned} \]

\[ \begin{aligned} \text{False Negative Rate}_\text{(0.75)} &= \frac{4}{4+1}=0.80 \\ \end{aligned} \]

Note here that because we are dealing with small samples, some of the error values are 0 - which is highly unlikely if we were to use a real data set!

So how do we decide which classification threshold to use? Rather than relying on a single default value, one approach is to examine how model performance changes across all possible thresholds. This idea underpins tools such as the Receiver Operating Characteristic (ROC) curve, which plots the trade-off between the true positive rate and the false positive rate as the threshold varies. The Area Under the Curve (AUC) then provides a single summary of how well the model separates the two classes, independent of any specific threshold choice.

While ROC curves and AUC are widely used in practice, particularly in machine learning and diagnostic testing, they go beyond the scope of this unit. Instead, our focus in this unit is on understanding the conceptual trade-off between false positives and false negatives and choosing thresholds based on the context and consequences of decision-making. This perspective equips us to make informed, transparent classification choices, even without advanced optimisation tools.

9.9 Practice with R

Let’s now work through a scenario that includes a proper data set.

9.9.1 Context

In poor, developing countries often the government wants to target support to poor households, to help them with basic needs like food, health care, and education. To make the best use of government money, the government needs to find a quick way of classifying a household as poor or not poor. Based on this classification, they may target support to the households who are poor.

Consider this example from Timor-Leste, a small country of 1.4 million people just north of Australia. The poverty rate in Timor-Leste is quite high (42% at last measurement), and the Government needs to target support to poor households. The question is:

How do they go about selecting the poor households?

9.9.2 The Data

Here we will use the “Timor-Leste Survey of Living Standards (TLSLS) undertaken in 2014, by the Government Statistics office, supported by the World Bank. This survey covered almost 6,000 households from across the country, carefully chosen to represent all households in the country, by using random selection and other more sophisticated methods.

This survey is very thorough: households are interviewed intensely, over almost 2 days, and asked to record all their food consumption and other spending for the past week. Based on this comprehensive data, data analysts can come up with a definite conclusion about whether the household is Poor or Not Poor. The concept of poverty that is used is known as Consumption Poverty, where if a household’s consumption is below a certain value (the Poverty Line), they classify as poor.

In total, we have a big dataset of almost 6,000 households. For each household, we know whether they are poor or not, and we also know a large range of characteristics of these households – their location, the family composition, quality of housing, everybody’s level of education, what assets they own, what job the adults have, and even things about people’s health.

In this example, we will only import some of the hundreds of variables in the dataset, the variables we have decided to use in the predictive model.

9.9.3 Download the data

The data set is stored in a R data format file named timor.RDS. Note: this is not an excel file, so you won’t be able to open it in Excel. Begin by downloading the file and moving it your working directory.

9.9.4 Load the data in RStudio

Next, let us load the data into our working environment in RStudio. Because the data file is a .RDS file, we will use the readRDS() function. In the code below, I am saving the file as ‘timor’.

R Code
timor <- readRDS("timor.RDS")

9.9.5 Load the tidyverse

Use the library() function to load the tidyverse package. This will provide us with all of the tools we need to complete this exercise.

R Code
library(tidyverse)

9.9.6 Inspect the data

Run the str() function to display the structure of the data frame:

R Code
str(timor)
'data.frame':   5916 obs. of  5 variables:
 $ poor     : num  0 1 0 0 0 1 1 0 0 0 ...
 $ hhsize   : int  7 7 8 2 4 10 5 1 6 4 ...
 $ district : chr  "Aileu" "Baucau" "Dili" "Viqueque" ...
 $ urban    : num  1 0 1 0 0 1 0 0 1 0 ...
 $ dirtfloor: num  0 0 0 1 0 0 1 1 1 1 ...

There are 5916 observations and five variables:

  • poor: whether a household officially classifies as poor; 1 indicates poor, and 0 otherwise
  • hhsize: household size
  • district: the district where a household locates
  • urban: whether a household is from an urban area; 1indicates yes, and 0 otherwise (rural)
  • dirtfloor: whether a household has a dirt floor; 1 indicates yes and 0 otherwise

9.9.7 Training and Testing sets

Let’s now create the model that can be used to predict whether a household is poor, based on these four characteristics.

First, we will split the data into two subsets: a training set and test set. There are 5916 households in the data set. It’s common practice to randomly split your data into the training and testing sets, however in this exercise we will specify the first 70% (4141) households as training data, and the remaining 30% for testing.

We can use the slice() function to slice our data into parts. Remember that our data set has 5916 households. therefore:

  • House 1 to House 4141 will be allocated to the training set
  • House 4142 to House 5916 will be allocated to the testing set
R Code
train <- timor |> slice(1:4141)
test <- timor |> slice(4142:5916)

To make sure you have done this correctly, check that you have two objects (train and test) stored in your global environment.

9.9.8 Building the model

Let’s now build a model on the training data. Remember our goal here is to build a model to predict whether or not a household might be poor (the poor variable) based on the 4 variables we have available.

R Code
timor_model <- lm(poor ~ hhsize + district + urban + dirtfloor, train)

We can now use the summary() function to inspect this model

R Code

Call:
lm(formula = poor ~ hhsize + district + urban + dirtfloor, data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.53529 -0.29741 -0.09162  0.34359  1.08261 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -0.299783   0.033717  -8.891  < 2e-16 ***
hhsize            0.084698   0.002439  34.726  < 2e-16 ***
districtAinaro    0.057872   0.037928   1.526 0.127126    
districtBaucau   -0.001921   0.035062  -0.055 0.956303    
districtBobonaro  0.209155   0.035955   5.817 6.44e-09 ***
districtCovalima  0.263496   0.037139   7.095 1.52e-12 ***
districtDili      0.056406   0.033856   1.666 0.095777 .  
districtErmera    0.110362   0.036710   3.006 0.002660 ** 
districtLautem   -0.005024   0.037320  -0.135 0.892911    
districtLiqui?a   0.143246   0.037530   3.817 0.000137 ***
districtManatuto  0.061082   0.037881   1.612 0.106935    
districtManufahi  0.146757   0.037924   3.870 0.000111 ***
districtOecussi   0.323493   0.036437   8.878  < 2e-16 ***
districtViqueque -0.019185   0.036352  -0.528 0.597693    
urban            -0.116596   0.014212  -8.204 3.07e-16 ***
dirtfloor         0.196339   0.013642  14.393  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3958 on 4125 degrees of freedom
Multiple R-squared:  0.2867,    Adjusted R-squared:  0.2841 
F-statistic: 110.5 on 15 and 4125 DF,  p-value: < 2.2e-16

9.9.9 Making predictions

With our model, we can now generate predictions (probabilities between 0 and 1) for each household, representing how likely it is that the model believes the household to be poor (poor = 1) or not poor (poor = 0). To do this, we can use the predict() function.

You will need to pass in two arguments to run the predict function:

  1. The object (in this case: our model from the previous step)
  2. The data we are predicting on (in this case: the test data)

In the code below, we are storing each prediction (1 for each observation in the test set) as an object called ‘predictions’.

R Code
predictions <- predict(timor_model, test)

9.9.10 Combining Actual and Predictions

We can now create a data set that combines our actual values (the poor variable) with the predictions we generated in the previous step. There are a number of ways to do this, and in this example we will use the mutate() function. In the code below, we can read this in plain English as:

  1. Start with the test data
  2. Create a new variable called preds based on the predictions object
  3. Create a new variable called class that classifies the predictions into either ‘poor’ or ‘not poor’ using a threshold of 0.50.
  4. Create a new variable called actual that provides text labels (instead of 0s and 1s) for the poor variable.

I’ll also save this as ‘res’.

R Code
res <- 
  test |> 
  mutate(preds = predictions) |> 
  mutate(class = ifelse(preds > 0.5, "poor", "not poor")) |> 
  mutate(actual = ifelse(poor == 1, "poor", "not poor"))

9.9.11 Confusion Matrix

We can now use the table() function to create a confusion matrix of our actual and predicted classifications:

R Code
table(Prediction = res$class, Actual = res$actual)
          Actual
Prediction not poor poor
  not poor     1047  313
  poor          128  287

9.9.12 Reorder factor levels (optional)

Note that in the confusion matrix above the negative class (not poor) is on top of the positive class (poor). This is because the table() function goes in alphabetical order. If we wanted the table the other way around we would need to reorder the variables. We can do this with the fct_relevel() function.

In the code below, I create a new data frame (called res2) which is the same as the previous, just with the variables reordered so that ‘not poor’ goes before ‘poor’.

R Code
res2 <- 
  res |> 
  mutate(actual = fct_relevel(actual, "poor", "not poor"),
         class = fct_relevel(class, "poor", "not poor"))

We can now rerun the table() function with the new data (res2):

R Code
table(Prediction = res2$class, Actual = res2$actual)
          Actual
Prediction poor not poor
  poor      287      128
  not poor  313     1047

9.9.13 Calculating FPR / FNR

From our table, we can see determine:

  • True Positive = 287
  • True Negative = 1047
  • False Positive = 128
  • False Negative = 313

And, we could use R as a calculator to determine various metrics, for example:

R Code
# Overall Accuracy
(287+1047) / (287+313+128+1047)
[1] 0.7515493
# Overall Error
(128+313) / (287+313+128+1047)
[1] 0.2484507
# FPR (Type I Error)
128/(128+1047)
[1] 0.1089362
# FNR (Type II Error)
313/(313+287)
[1] 0.5216667

In practice, it would be better to store these values as objects in our global environment. For example we can save the confusion matrix as an object (in the code below saved as tbl) and then extract out each cell of the table (based upon their position within the table).

R Code
tbl <- table(Prediction = res2$class, Actual = res2$actual)

# extract and save the top left cell and save it as 'TP'
TP <- tbl[1]

# extract and save the bottom left cell and save it as 'FN'
FN <- tbl[2]

# extract and save the top left cell and save it as 'FP'
FP <- tbl[3]

# extract and save the bottom left cell and save it as 'TN'
TN <- tbl[4]

Now, we can set up our formulas to calculate the different error terms using the stored objects, rather than individual numbers. Check that these values match with what we had previously.

R Code
# Overall Accuracy
(TP + TN) / (TP + FP + FN + TN)
[1] 0.7515493
# Overall Error
(FP + FN) / (TP + FP + FN + TN)
[1] 0.2484507
# FPR (Type I Error)
FP/(FP+TN)
[1] 0.1089362
# FNR (Type II Error)
FN/(FN+TP)
[1] 0.5216667

But why is this important?

From a coding perspective, the key advantage of setting things up this way is that we don’t need to manually re-enter numbers every time we change the threshold.

Because the calculations are written in terms of TP, FP, FN, and TN, the code does not change — only the threshold does. This makes the analysis:

  • reproducible,
  • less error-prone,
  • and easy to extend (e.g. looping over many thresholds to see how FPR and FNR change).

As an example, I’m copied and pasted the same code from earlier in the section below. The only difference is that I’ve changed the value of the threshold to 0.40.

R Code
res <- 
  test |> 
  mutate(preds = predictions) |> 
  mutate(class = ifelse(preds > 0.4, "poor", "not poor")) |> #Threshold changed
  mutate(actual = ifelse(poor == 1, "poor", "not poor"))

res2 <- 
  res |> 
  mutate(actual = fct_relevel(actual, "poor", "not poor"),
         class = fct_relevel(class, "poor", "not poor"))

tbl <- table(Prediction = res2$class, Actual = res2$actual)
TP <- tbl[1]
FN <- tbl[2]
FP <- tbl[3]
TN <- tbl[4]
# Overall Accuracy
(TP + TN) / (TP + FP + FN + TN)
[1] 0.7616901
# Overall Error
(FP + FN) / (TP + FP + FN + TN)
[1] 0.2383099
# FPR (Type I Error)
FP/(FP+TN)
[1] 0.2025532
# FNR (Type II Error)
FN/(FN+TP)
[1] 0.3083333

Notice how by simply changing one value (0.5 to 0.4), and then running the code again, quickly computes these metrics for us? This is the advantage of setting up our code in a reproducible format! Try it with different thresholds.

9.10 Summary

In this chapter, we extended the modelling ideas you already know from linear regression into the world of classification, where the outcome is not a number but a category (often coded as 0/1). While the goal changes—from predicting a continuous value like margin or income to predicting a class like win/loss or poor/not poor—the workflow remains familiar: we select predictors, fit a model using training data, and then evaluate how well that model performs on new observations.

We also introduced an important mindset shift: explain versus predict. Explanatory modelling focuses on interpreting coefficients to understand relationships, whereas predictive modelling focuses on generating accurate forecasts for unseen cases. This distinction becomes especially important in classification because a model can look “good” on the data it was trained on, but still perform poorly on new data. That is why we emphasised splitting data into training and testing sets, and evaluating performance using the test set.

To take our first step into classification modelling, we used a familiar tool in an unfamiliar way: the linear probability model (LPM). By coding outcomes as 0 and 1, the fitted values from a linear regression can be interpreted as predicted probabilities. This makes the LPM intuitive, because coefficients can be read as changes in probability (e.g., a one-unit increase in a predictor increases the chance of being in class 1 by some number of percentage points). At the same time, we highlighted why the LPM has limitations—most notably that it can produce probabilities below 0 or above 1—motivating the need for logistic regression in the next chapter.

Finally, we focused on how probabilities become decisions. A predicted probability is useful, but many practical problems require a hard classification—so we choose a threshold (such as 0.5) to convert probabilities into predicted classes. Changing the threshold does not change the model, but it does change the confusion matrix and therefore the balance of false positives (Type I errors) and false negatives (Type II errors). This is why we used confusion matrices and calculated metrics like FPR and FNR: they help us understand what kind of mistakes a model is making, not just how often it is wrong. From a coding perspective, once we store the confusion matrix counts (TP, FP, FN, TN) as objects, we can change thresholds quickly and recompute performance without retyping numbers—making the whole workflow reproducible and scalable.

9.11 Exercises

Question 1

In a linear probability model (LPM) where the outcome is coded 0/1, what does a fitted value \(\hat{y}\) represent, and how do we interpret a slope coefficient?

In an LPM, the fitted value \(\hat{y}\) is interpreted as an estimated probability that the outcome equals 1 (e.g., \(P(Y=1)\)). A slope coefficient tells us how that probability changes when the predictor increases by one unit. For example, if \(\beta_1=0.02\), then a one-unit increase in the predictor is associated with a 2 percentage-point increase in the predicted probability of \(Y=1\) (holding other predictors constant).

Question 2

Why do we need to choose a threshold when the model outputs predicted probabilities? What is the difference between a probability prediction and a class prediction?

A model often outputs a probability (a number between 0 and 1), which describes how likely an observation is to belong to class 1. But many real decisions require a hard classification (class 0 or class 1). A threshold is the rule that converts probabilities into class labels (e.g., classify as 1 if \(p \ge0.5\)).

Question 3

Suppose we lower the threshold from 0.50 to 0.40. What do you expect to happen to predicted positives, false positives, and false negatives (in general), and why?

Lowering the threshold makes it easier for an observation to be classified as positive (class 1). So:

  • Predicted positives increase (more observations exceed the threshold).
  • False positives tend to increase (more negative cases get labelled as positive).
  • False negatives tend to decrease (fewer positive cases get missed).

This happens because changing the threshold shifts the decision boundary: you are trading off one type of error against the other.

Question 4

Two models have the same overall accuracy. Explain why one model could still be preferred, using false positives vs false negatives.

Accuracy alone hides what kind of mistakes the model is making. Two models can have the same accuracy but very different balances of:

  • False positives (Type I errors): predicting positive when the truth is negative
  • False negatives (Type II errors): predicting negative when the truth is positive

Which model is better depends on the cost of errors in the real context. For example, in a medical screening context, a model with fewer false negatives may be preferred (missing a disease can be severe), even if its overall accuracy is the same as another model.

Question 5

Why do we evaluate classification models on a test set instead of only using the training set? What does overfitting look like in classification?

We evaluate on a test set because it provides an estimate of performance on new, unseen data. A model can perform very well on training data simply because it has learned patterns that are specific to that sample, including noise.

In classification, overfitting looks like:

  • very high training accuracy (or very low training error),
  • but noticeably worse test accuracy (or higher test error).

This indicates the model has not learned general patterns that transfer well to new data

Question 6

It is commonly believed that young people are much more at risk of being involved in an accident than older drivers. In this exercise you will look at results based on a database with details of every death in road accidents in Australia over the period 1989 to 2020. Begin by downloading the accidents.csv file and load it into your RStudio.


Estimate a regression model that explores the trend in proportion of young person fatalities over time using the following variables:

  • Dependent variable: Age (1 = young, 0 = not young)
  • Independent variable: t (time, where 0 is 1988)
  1. What is the average annual rate of decline in proportion of fatalities that are young people aged 17-25?
  2. Use the model to estimate the proportion of fatalities among young people in 1988.
  3. Use the model to predict the proportion of fatalities among young people in the year 2020

a.

model <- lm(Age ~ t, accidents)
coef(model)
 (Intercept)            t 
 0.431778174 -0.006305964 

The beta coefficient for t is -0.0063. This tells us that the annual rate of decline is 0.633% per year.

b.

From the output in part a, the equation for this model is:

\[Age=0.4318-0.0063(t)\]

Therefore when t = 0 (which is the year 1988), \(Age=0.4318\). The estimated proportion of fatalities is 43.18%. Note: The intercept from the output also gives us the same answer.

c.

If t = 0 is 1988, then t = 32 for the year 2020. Then:

\[Age=0.4318-0.0063(32)=0.23\] Approximately 23% of fatalities attributed to young drivers.

Question 7

Continuing from the previous question, estimate a multiple regression model that explores the circumstances where a fatality is more likely to be a young person. Use:

  • Dependent variable: Age (1 = young, 0 = not young)
  • Independent variable:
    • t (time, where 0 is 1988)
    • Male (1 = male, 0 = otherwise)
    • Night (1 = Accident occurred at night, 0 = otherwise)
    • Weekend (1 = Accident occured on weekend, 0 = otherwise)
    • Single.Vehicle (1 = Accident involved only 1 vehicle, otherwise)
  1. Compare the R-Square values across the two models (question 6 and 7). Explain what R-Square is measuring, and comment on the differences between the two models.
  2. Explain how the interpretation of the coefficient of t is different in this model compared to the first model.
  3. Despite adding several variables to the model, the coefficient of t is very similar for both models. Can you think of a reason why this might be?
  4. Refer to the p-value column of values, Pr(>|t|), and explain briefly what are the circumstances where the person killed is more likely to be a young person aged under 25.
  5. Interpret the coefficient of “Night”.
  6. Interpret the coefficient of “Weekend”

a.

model2 <- lm(Age ~ t + Male + Night + Weekend + Single.Vehicle, accidents) 
summary(model)$r.square
[1] 0.01417989
summary(model2)$r.square
[1] 0.05077797

In the first model, R2=1.41%, in the second model it is 5.08%, slightly bigger. R2 says the % of variation in Y that is explained by the set of X variables included in the regression model. The addition of several variables in the second model will increase the R2.

b.

summary(model2)

Call:
lm(formula = Age ~ t + Male + Night + Weekend + Single.Vehicle, 
    data = accidents)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.5784 -0.3547 -0.2457  0.5480  0.8817 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.3164657  0.0054129   58.47   <2e-16 ***
t              -0.0061917  0.0002289  -27.05   <2e-16 ***
Male            0.0021519  0.0045819    0.47    0.639    
Night           0.1165304  0.0043378   26.86   <2e-16 ***
Weekend         0.0583264  0.0043115   13.53   <2e-16 ***
Single.Vehicle  0.0911346  0.0042440   21.47   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4624 on 51196 degrees of freedom
Multiple R-squared:  0.05078,   Adjusted R-squared:  0.05069 
F-statistic: 547.7 on 5 and 51196 DF,  p-value: < 2.2e-16

Here, we hold other X’s constant. i.e. Compare two accidents with same gender, day or night, weekday or weekend, single or multiple vehicles, the coefficient of t says the average annual decline in percent of fatalities aged 17-25 (i.e. young driver) is 0.00619.

c.

The other variables have not changed much over time, and not systematically with age, so leaving them out of the first model does not affect the coefficient of t.

d.

Much more likely at night, on weekend, and in single vehicle accident.

e.

If an accident occurs at night time, there is an estimated 11.65% higher chance the person killed is a young person compared to daytime, holding other variables constant.

f.

If an accident occurs on a weekend, there is an estimated 5.83% higher chance the person killed is a young person compared to on a weekday, holding other variables constant.

Question 8

In this exercise you will look at the issue of the “Youth Bulge” in a neighbouring developing country, Timor-Leste. The Youth Bulge describes a situation where there are many young people completing education and looking for work, but very few jobs for them. Your focus will be on young people aged 25-29, the age where most have finished their education, and in the early stages of their employment.

Based on the 2015 Census, each of these young people have been put into one of 5 categories:

  • Formal: working in a formal sector job
  • Farmer: the person’s main work is as a self-employed farmer
  • Infnonfarm: the person’s main work is as a self-employed informal sector activity
  • Unemployed: the person is unemployed (doesn’t have a job, and wants a job)
  • Not in Labour Force: the person is not working or looking for work – mostly students or fulltime parents

The Census also provides the following information about each person in the dataset:

  • English = 1 if the person can read, write and speak English
  • Primary = 1 if the highest level of education achieved is completing primary school
  • Secondary = 1 if the highest level of education achieved is completing secondary school
  • HigherEd = 1 if the highest level of education achieved is completing higher education
  • Male = 1 if the person reports as male
  • Married = 1 if the person is married
  • Mother = 1 if the person is a mother
  • Dili = 1 if the person lives in the capital city, Dili
  • Numdis = the number of disabilities the person has (people can have up to 4 disabilities: seeing, hearing, mobility, mental)

First, download and load the data from the file “youthbulge.csv” into R and save it in a data frame labelled youthbulge.


Estimate a multiple regression model with the dependent variable being formal, which takes the value 1 if the person has a formal sector job, and equals 0 if they do not. Include english, primary, secondary, highered, male, married, mother, dili and numdis as the independent variables.

  1. What do you learn from the p-values and coefficients on the 3 education dummy variables (NB remember the base category is a person with no education, or only some primary education). Explain carefully.

  2. Based on the model results, how much more likely is a married male to have a formal sector job compared to an unmarried male?

  3. Based on the model results, compare the likelihood having a formal job for a single female (not a mother) with that likelihood for a married female who is a mother. Explain your calculations briefly.

  4. Using the coefficient of Numdis, what is the difference in likelihood of having a formal job between a person with 2 disabilities and a person with no disabilities, if all their other characteristics are the same?

  5. Use the model to predict the probability of formal employment for two types of people (just approximate prediction is fine, say 2 decimal places):

    • an unmarried young woman with no children, no education, cannot speak English, has one disability, and lives outside Dili.
    • a married male with higher education qualifications, and can speak English, has no disabilities and lives in Dili.

    Comment on the differences in probabilities you find here.

  6. A famous Timorese politician recently said “Yes, 27% of young males have formal sector jobs and only 16% of females, but this is not because of sexism. It is because males usually have higher levels of education. If we make the education system fairer, that will eliminate the gender differences in employment”. Use the multiple regression results here to carefully critique this claim. Explain your reasoning

model1 <- lm(formal ~ english + primary + secondary + highered + male + married
+ mother + dili + numdis, youthbulge)

summary(model1)

Call:
lm(formula = formal ~ english + primary + secondary + highered + 
    male + married + mother + dili + numdis, data = youthbulge)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.51661 -0.25630 -0.13027 -0.03428  1.07519 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.034282   0.004240   8.086 6.22e-16 ***
english      0.082441   0.003294  25.024  < 2e-16 ***
primary      0.007282   0.004308   1.690   0.0910 .  
secondary    0.066413   0.003774  17.600  < 2e-16 ***
highered     0.098652   0.004724  20.882  < 2e-16 ***
male         0.059929   0.003622  16.545  < 2e-16 ***
married      0.115282   0.003059  37.689  < 2e-16 ***
mother      -0.085704   0.004304 -19.913  < 2e-16 ***
dili         0.126026   0.002926  43.073  < 2e-16 ***
numdis      -0.023768   0.011718  -2.028   0.0425 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3932 on 92393 degrees of freedom
Multiple R-squared:  0.08185,   Adjusted R-squared:  0.08176 
F-statistic: 915.2 on 9 and 92393 DF,  p-value: < 2.2e-16

a.

p-values are significant for Secondary and HigherEd, but NOT for Primary. So, evidence is that having Secondary or HigherEd significantly increases the chances of having a Formal sector job, but there is no evidence that Primary makes you more likely to have a Formal sector job than no education.

This is confirmed by the fact that the coefficient on Primary is very small – even if we had found significant evidence that primary education increases Pr(Formal), the effect is very small – estimated difference is 0.7%.

Secondary educated person has a 6.6% greater chance of getting a formal sector job compared to someone with no education or less than primary school.

Person with HigherEd has a 9.9% greater chance of getting a formal sector job compared to someone with no education or less than primary school.

Both have strong positive effects on chances of a formal sector job, but HigherEd is clearly superior.

b.

11.5%, the coefficient on married, assuming other characteristics are the same.

c.

Coefficient of Mother + Married = 0.115 – 0.086 = 0.029. So, a married mother is 2.9% more likely to have a formal sector job compared to a single female who is not a mother

d.

-0.024 × 2 = -0.048. A person with 2 disabilities is 4.8 percentage points less likely to have a formal sector job, compared to someone with no disabilities with the same other characteristics.

e.

  • Unmarried woman Pr(Formal) = 0.034 – 0.024 = 0.01, 1% chance
  • Married male, etc Pr(Formal) = 0.034 + 0.082 + 0.099 + 0.060 + 0.115 + 0.126 = 0.51 approximately, or 51%.

Huge difference, 51% compared to 1%. The female has multiple disadvantages that cumulate across them. The male is opposite, has all the benefits of location, education, no disability, gender, etc. This highlights how unequal the opportunities are.

f.

The coefficient of Male is 0.06, so if we compare a male and a female with the same other characteristics, including education level, the male is 6 percentage points more likely than the female to have a formal sector job. This is a statistically significant difference (small p-value).

So, this contradicts the politician’s claim. Even when we compare people with the same education level, males are still much more likely to have a formal sector job. So, eliminating differences in educational attainment won’t eliminate the gender differences.

Question 9

The data file below contains a simulated data set containing information on 1,645 customers. The aim here is to predict which customers will default on their credit card debt.

  • default: A factor with levels 0 = No and 1 = Yes indicating whether the customer defaulted on their debt
  • student: A factor with levels No and Yes indicating whether the customer is a student
  • balance: The average balance that the customer has remaining on their credit card after making their monthly payment
  • income: Income of customer


  1. Download and load the data file into R as ‘default’.
  2. Split the data into training and testing sets. Use the first 1000 observations for training and the remaining for testing.
  3. Construct a LPM for default with student, balance and income as independent variables.
  4. Use your model to make predictions on the test set.
  5. Use a threshold of 0.40 to classify your predictions into “No” and “Yes” for defaulting.
  6. Construct a confusion matrix for this scenario.
  7. Using your confusion matrix, compute the False Positive and False Negative rates.
# a. 
default <- read.csv("data/Default.csv")

# b.
train <- default |> slice(1:1000)
test <- default |> slice(1001:1645)

# c.
model_default <- lm(default ~ student + balance + income, train)

# d.
class_pred <- predict(model_default, test)

# e. 
res <- test |> 
  mutate(predictions = class_pred) |> 
  
  mutate(class = ifelse(predictions < 0.40, "No", "Yes"),
         actual = ifelse(default == 0, "No", "Yes")) |> 
  
  mutate(class = fct_relevel(class,"Yes","No"),
         actual = fct_relevel(actual,"Yes","No"))

# f.
tbl <- table(Predictions = res$class, Actual = res$actual)

# g.
TP <- tbl[1]
FN <- tbl[2]
FP <- tbl[3]
TN <- tbl[4]

FPR <- FP/(FP+TN)
FNR <- FN/(FN+TP)